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Abstract 

Gaussian elimination with full pivoting generates a PLUQ matrix de- 
composition. Depending on the strategy used in the search for pivots, the 
permutation matrices can reveal some information about the row or the 
column rank profiles of the matrix. We propose a new pivoting strategy 
that makes it possible to recover at the same time both row and column 
rank profiles of the input matrix and of any of its leading sub-matrices. 
We propose a rank-sensitive and quad-recursive algorithm that computes 
the latter PLUQ triangular decomposition of an m x n matrix of rank r 
in O (mnr"~^) field operations, with lu the exponent of matrix multipli- 
cation. Compared to the LEU decomposition by Malashonock, sharing a 
similar recursive structure, its time complexity is rank sensitive and has 
a lower leading constant. Over a word size finite field, this algorithm also 
improveLs the practical efficiency of previously known implementations. 

1 Introduction 

Triangular matrix decomposition is a fundamental building block in com- 
putational linear algebra. It is used to solve linear systems, compute the 
rank, the determinant, the nullspace or the row and column rank profiles 
of a matrix. The LU decomposition, defined for matrices whose leading 
principal minors are all nonsingular, can be generalized to arbitrary di- 
mensions and ranks by introducing pivoting on sides, leading e.g. to the 
LQUP decomposition of [6] or the PLUQ decomposition [5, 8]. Many al- 
gorithmic variants exist allowing fraction free computations [8], in-place 
computations [2, 7] or sub-cubic rank-sensitive time complexity [II, 7]. 
More precisely, the pivoting strategy reflected by the permutation matri- 
ces P and Q is the key difference between these PLUQ decompositions. 
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In numerical linear algebra [5], pivoting is used to ensure a good numer- 
ical stability, good data locality, and reduce the fill-in. In the context of 
exact linear algebra, the role of pivoting differs. Indeed, only certain piv- 
oting strategies for these decompositions will reveal the rank profile of the 
matrix. The latter is crucial in many applications using exact Gaussian 
elimination, such as Grobner basis computations [4] and computational 
number theory [10]. 

The row rank profile of an m x n matrix with rank r is a lexicographi- 
cally smallest sequence of r row indices such that the corresponding rows 
of the matrix are linearly independent. Similarly the column rank profile 
is a lexicographically smallest sequence of r column indices such that the 
corresponding rows of the matrix are linearly independent. 

The common strategy to compute the row rank profile is to search for 
pivots in a row-major fashion: exploring the current row, then moving to 
the next row only if the current row is zero. Such a PLUQ decomposition 
can be transformed into a CUP decomposition (where C = PL is in col- 
umn echelon form) and the first r values of the permutation associated to 
P are exactly the row rank profile [7] . A block recursive algorithm can be 
derived from this scheme by splitting the row dimension [6] . Similarly, the 
column rank profile can be obtained in a column major search: exploring 
the current column, and moving to the next column only if the current 
one is zero. The PLUQ decomposition can be transformed into a PLE 
decomposition (where E = UQ is in row echelon form) and the first r 
values of Q are exactly the column rank profile [7]. The corresponding 
block recursive algorithm uses a splitting of the column dimension. 

Recursive elimination algorithms splitting both row and column di- 
mensions include the TURBO algorithm [3] and the LEU decomposi- 
tion [9]. No connection is made to the computation of the rank profiles 
in any of them. The TURBO algorithm does not compute the lower tri- 
angular matrix L and performs five recursive calls. It therefore implies 
an arithmetic overhead compared to classic Gaussian elimination. The 
LEU decomposition aims at reducing the amount of permutations and 
therefore also uses many additional matrix products. As a consequence 
its time complexity is not rank-sensitive. 

We propose here a pivoting strategy following a Z-curve structure and 
working on an incrementally growing leading sub-matrix. This strategy is 
first used in a recursive algorithm splitting both rows and columns which 
recovers simultaneously both row and column rank profiles. Moreover, the 
row and column rank profiles of any leading sub-matrix can be deduced 
from the P and Q permutations. We show that the arithmetic cost of 
this algorithm remains rank sensitive of the form 0{mnr"~^) where uj is 
the exponent of matrix multiplication. The best currently known upper 
bound for to is 2.3727 [12]. As for the CUP and PLE decompositions, 
this PLUQ decomposition can be computed in-place. We also propose an 
iterative variant, to be used as a base-case. 

Compared to the CUP and PLE decompositions, this new algorithm 
has the following new salient features: 

• it computes simultaneously both rank profiles at the cost of one, 

• it preserves the squareness of the matrix passed to the recursive calls. 
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thus allowing more efficient use of the matrix multiplication building 
block, 

• it reduces the number of modular reductions in a finite field, 

• a CUP and a PLE decompositions can be obtained from it, with row 
and column permutations only. 

Compared to the LEU decomposition, 

• it is in-place, 

• its time complexity bound is rank sensitive and has a better leading 
constant, 

• a LEU decomposition can be obtained from it, with row and column 
permutations. 

In Section 2 we present the new block recursive algorithm. Section 3 
shows the connection with the LEU decomposition and section 4 states 
the main property about rank profiles. We then analyze the complexity of 
the new algorithm in terms of arithmetic operations: first we prove that it 
is rank sensitive in Section 5 and second we show in section 6 that, over a 
finite field, it reduces the number of modular reductions when compared 
to state of the art techniques. We then propose an iterative variant in 
Section 7 to be used as a base-case to terminate the recursion before the 
dimensions get too small. Experiments comparing computation time and 
cache efficiency are presented in section 8. 

2 A recursive PLUQ algorithm 

We first recall the name of the main sub-routines being used: MM stands 
for matrix multiplication, TRSM for triangular system solving with matrix 
unknown (left and right variants are implicitly indicated by the param- 
eter list), PermC for matrix column permutation, PermR for matrix row 
permutation, etc. For instance, we will use: 

MM(C, A, B) to denote C ^ C - AB, 

TRSM(f/, B) for B ^ U~^B with U upper triangular, 

TRSM(B, L) for B BL^^ with L lower triangular. 

We also denote by Tk,i the transposition of indices k and I and by L\U, 
the storage of the two triangular matrices L and U one above the other. 
Further details on these subroutines and notations can be found in [7]. In 
block decompositions, we allow for zero dimensions. By convention, the 
product of any m x matrix by an x n matrix is the mx n zero matrix. 

We now present the block recursive algorithm 1, computing a PLUQ 
decomposition. 
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Algorithm 1 PLUQ 

Input: A — (aij) a m X n matrix over a field 
Output: P,Q'. m X m and n x n permutation matrices 
Output: r: the rank of A 



Output: A 



L\U V 
M 



upper triangular, and 



where L is r x r unit lower triangular, U is r x r 
[U V]Q. 



A: 



if m=l then 

ifA=[0 ... 0] thenP4- [1] ,Q^/„,r^O 
else 

i -s— the col. index of the first non zero clt. of A 

[1] ;Q^ri,„r^ 1 
Swap ai^i and ai.i 
end if 

Return (P, Q, r, A) 
end if 
if n=l then 

ifA=[0 ... 0]"^ then P ^ /™;Q ^ [1] ,r ^ 
else 

i -s— the row index of the first non zero elt. of A 

[1] ,g^ri,„r^ 1 

Swap fli^i and ai.i 

for j = i + 1 . . .m do aj^i -S— aj^iai\ 
end for 
end if 

Return (P, Q, r, A) 
end if 

> the trailing parts of the algorithm are shown on next pages 



It is based on a splitting of the matrix in four quadrants. A first recur- 
sive call is done on the upper left quadrant followed by a series of updates. 
Then two recursive calls can be made on the anti-diagonal quadrants if 
the first quadrant exposed some rank deficiency. After a last series of 
updates, a fourth recursive call is done on the bottom right quadrant. 
Figure 1 illustrates the position of the blocks computed in the course of 
algorithm 1, before and after the final permutation with matrices S and T. 

This framework differs from the one in [3] by the order in which the 
quadrants are treated, leading to only four recursive calls in this case 
instead of five in [3] . We will show in section 4 that this fact together with 
the special form of the block permutations S and T makes it possible to 
recover rank profile information. The correctness of algorithm 1 is proven 
in appendix A. 

Remark 1. Algorithm 1 is in-place (as defined in [7, Definition 1]): all 
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where Ai is J x [fj. 

o PLUQ(^i) 

PerinR(^2,f'f ) 
i> PerinC(^3,Qf) 



> TRSM(Li,Si) 

> TRSM(Gi,/7i) 
o MM(B2,Mi,D) 

> m{C2,E,Vi) 
MM(A4,£;,£)) 



i> PLUq(F) 
PLUQ(G) 

o PerinR(iI, Pj); PermC(i?, Ql) 
o PermR(£', Pg^) 

> PermR(Mi,P2^) 

> PerniR(D, Ql) 
o PerinR(Vi,Q|') 
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Return (P, Q, ri + r2 + rs + Ti, A) 
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Figure 1: Block recursive Z-curvc PLUQ decomposition and final block permu- 
tation. 



operations of the TRSM, MM, PermC, PermR subroutines work with 0(1) extra 
memory allocations except possibly m the course of fast matrix multipli- 
cations. The only constraint is for the computation of J -ir- LJ^I which 
would overwrite the matrix I that should be kept for the final output. Hence 
a copy of I has to be stored for the computation of J. The matrix I has di- 
mension r-j, X r2 and can be stored transposed in the zero block of the upper 
left quadrant (of dimension (y — ri) x — ri), as shown on Figure 1). 



3 From PLUQ to LEU 



We now show how to compute the LEU decomposition of [9] from the 
PLUQ decomposition. The idea is to write 



= P 



L 



Ir 



U V 

In — 7 



Q 



and show that L and U are respectively lower and upper triangular. This 
is not true in general, but turns out to be satisfied by the P, L, M, U, V 
and Q obtained in algorithm 1. 

"L" 



Theorem 1. Let A 



M 



\UV] Q be the PLUQ decomposition com- 



puted by algorithm 1. Then for any unit lower triangular matrix Y and 



any upper triangular matrix Z , the matrix P 



L 

MY 



P IS unit lower tri- 



angular and ' 



UV 
Z 



is upper triangular. 



Proof. Proceeding by induction, we assume that the theorem is true on all 

_P^ 

Yj 



four recursive calls, and show that it is true for the matrices P[^j y 



and Q'^[^'^]Q. Let Y = 



YoYa 



where Yi is unit lower triangular of 
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dimension fc — ri — r2. From the correctness of algorithm 1 (see e.g. 

Mil L2 
M12M2Y1 



Equation A), S 



L 
MY 



Hence P 



L 

MY 



El I L3 
E21 Ki M31 L4 
E22 K2Y2M32M4Y3 
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Pi 



Pa 
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E22 K2Y2M32M4,Y3^ 

pT 



By induction hypothesis, the matrices L2 ~ P2 



L2 

M2Y1 



Pi ,Li = Pi 



Li 

MiY3 



, Pi 



Li_ 
M1L2 



and P3 



L3_ 

MsLi 



P3 are unit lower triangular. Therefore 



the matrix P[^f y] P^ ^-l^o unit lower triangular. 



Similarly, let Z 



- ri 



Z1Z2 
Z3 



r2. The matrix 



where Zi is upper triangular of dimension 
T equals 
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UiVii 
U3 V3 
Zi 
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U2 V21 V22 
Ui Vi 

Z3 



Q3 
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By induction hypothesis, the matrices U-j, — Ql 



U3V3 
^1 



3, (74 = Qa 



U4V4 
Z3 



, QT 



Q2 are upper triangular. Consequently 



'U^^An A nT \U2V2 
jj- Qi and Q2 jy 

the matrix is upper triangular. 

For the base case with m = 1. The matrix L has dimension 1x1 and 
is unit lower triangular. If r = 0, then U — I^ZI„ is upper triangular. 
If r = 1, then Q — Ti^i where i is the column index of the pivot and is 
therefore the column index of the leading coefficient of the row [UV] Q. 
Applying on the left only swaps rows 1 and i, hence row [UV] Q is 



the ith row of Q'' 



UV 
Z 



Q. The latter is therefore upper triangular. The 



same reasoning can be applied to the case n — 1. 



Corrolary 1. Let L — P 



L 

MI„ 



P\E 







and U 



□ 

UV' 




Then A — LEU is a LEU decomposition of A. 

Remark 2. The converse is not always possible: gtven A = L, E, U , there 
are several ways to choose the last m — r columns of P and the last n — r 
rows of Q. The LEU algorithm does not keep track of these parts of the 
permutations. 



4 Computing the rank profiles 

We prove here the main feature of the PLUQ decomposition computed 
by algorithm 1: it reveals the row and column rank profiles of all leading 
sub-matrices of the input matrix. We recall in Lemma 1 basic properties 
verified by the rank profiles. 

Lemma 1. For any matrix, 

1. the row rank profile is preserved by right multiplication with an in- 
vertible matrix and by left multiplication with an invertible upper 
triangular matrix. 

2. the column rank profile is preserved by left multiplication with an 
invertible matrix and by right multiplication with an invertible lower 
triangular matrix. 

Lemma 2. Let A — PLUQ be the PLUQ decomposition computed by al- 
gorithm 1. Then the row (resp. column) rank profile of any leading (k,t) 
submatrix of A is the row (resp. column) rank profile of the leading (k,t) 



submatrix of P 







Proof. With the notations of corollary 1, we have: 



A = P 



L 

Mir, 



Ir 



U V 



= LP 



Ir 



QU 



Hence 



f/fcOl A 



It 




Li /fcO P 



QUi, 
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where Li is the k x k leading submatrix of L (hence it is an invertible 
lower triangular matrix) and Ui is the txt leading submatrix of U (hence 
it is an invertible upper triangular matrix). Now, Lemma 1 implies that 





the rank profile of f/fcOl A 



is that of JfcO P 



'Ir ■ 




'It 





Q 






□ 



From this lemma we deduce how to compute the row and column rank 
profiles of any {k, t) leading submatrix and more particularly of the matrix 
A itself. 

Corrolary 2. Let A — PLUQ he the PLUQ decomposition of a m x n 
matrix computed by algorithm 1. The row (resp. column) rank profile of 
any {k, t) -leading submatrix of a A is the sorted sequence of the row (resp. 
column) indices of the non zero rows (resp. columns) in the matrix 



R = IM P 







Q 



Corrolary 3. The row (resp. column) rank profile of A is the sorted se- 
quence of row (resp. column) indices of the non zero rows (resp. columns) 
of the first r columns of P (resp. first r rows of Q). 



5 Complexity analysis 

We study here the time complexity of algorihtm 1 by counting the number 
of field operations. For the sake of simplicity, we will assume here that 
the dimensions m and n are powers of two. The analysis can easily be 
extended to the general case for arbitrary m and n. 

For i = 1, 2, 3, 4 we denote by Ti the cost of the i-th recursive call to 
PLUQ, on a ^ X ^ matrix of rank r^. We also denote by rTRSM(j7i,n) the 
cost of a call TRSM on a rectangular matrix of dimensions m x n, and by 
Tmijn, k, n) the cost of multiplying an m x fc by an fc x n matrix. 

Theorem 2. Algorithm 1, run on an m x n matrix of rank r, performs 
O (mnr'^^ ) field operations. 

Proof. Let T = Tpldq(?71, n, r) be the cost of algorithm 1 run on a m x n 
matrix of rank r. From the complexities of the subroutines given, e.g., 
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in [2] and the recursive calls in algorithm 1, we have: 

TTl 7% 

T=Ti +T2+T3 + T4 + TTBsniri, — ) + Tissniri, -) 
+iTRSM(r-2, + -/iRSMira, - j + Jmh(-^ - ri,ri, -) 
+Tmm( — ,7-1,- -ri) + rm( — ,ri,-) 

Til TXXi Tl 

+r„„(r3,r2, - - r2) +Tm(— - r3,r2, - - r2 - r4) 
+T„m(— - r3,r-3, - - r2 - r4) 

mn „„2 , mn „_2 mn j^_2 

+?'3 ) H r, H To H r.> 

^ ^ 22 22 22"^ 

<ri + T2 + r3 + r4 + A-'mnr'"-^ 

for some constants K and A"' (we recall that a'^~^ + < 2"^~"(a + 
for 2 < ij < 3). 

Let C = max{ ^_^-2^ ; Ij.Then we can prove by a simultaneous in- 
duction on m and n that T < Cmnr'^~^ . 

Indeed, if (r = 1, m = 1, n > m) or (r = l,n = > ?i) then 

T < m — 1 < Cmnr"~^ . Now if it is true for m = 2-',n = 2', then for 
m = 2^+\n = 2*+\ we have 

^ ^ / cij — 2 I uj — 2 I ij} — 2 I ti; — 2\ . r^/ lj — 2 

l<—mn(ri + ^2 + +^4 ) + -K mnr 

/^/r)3 — Lt; \2 

< — ^ —mnr + K mnr 

~ 4 

24-2u 

<_?ir' — —mnr"~^ -\- K'mnr'^'^^ < Cmnr'^^^ . 

- l_ 24-2t^ - 

□ 

In order to compare this algorithm with usual Gaussian elimination 
algorithms, we now refine the analysis to compare the leading constant 
of the time complexity in the special case where the matrix is square and 
has a generic rank profile: ri = ^ = ^ , r2 = 0, rs = and ^'4 = ^ = f 
at each recursive step. 

Hence we have 

JpLOQ— zJpLuql,-^, 2 ' 1^) + zJtrsmI,-^, 2 i + -'mmV"^i 2 ' 2"' 
=2rpLuq(|, ^, |) + 22„f;"_ 2 (i) +^"(1 
Writing Tpu]ii{n,n,n) = an", the constant a satisfies: 
1/1 \ '^"-2 

« = <::•„— — - — = — - + 1 = a 



(2" - 2) V2"-2 - 1 J (2^^ - 2)(2"-2 - 1) ■ 

which is equal to the constant of the CUP and LUP decompositions [7, 
Table 1]. In particular, it equals 2/3 when uj — 3,Cui = 2, matching the 
constant of the classical Gaussian elimination. 
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6 Number of modular reductions over a 
prime field 

In the following we suppose that the operations are done with full delayed 
reduction for a single multiplication and any number of additions: opera- 
tions of the form aibi are reduced only once at the end of the addition, 
but a ■ b ■ c requires two reductions. In practice, only a limited amount 
of accumulations can be done on an actual mantissa without overflowing, 
but we neglect this in this section for the sake of simplicity. See e.g. [2] 
for more details. For instance, with this model, the number of reductions 
required by a classic multiplication of matrices of size m x fc by fc x n is 
simply: m ■ n. We denote this by RMM{m, k, n) = mn. This extends e.g. 
also for triangular solving: 

Theorem 3. Over a prime field modulo p, the number of reductions mod- 
ulo p required by TRSM{m, n) with full delayed reduction is: 

RunitTRSM{m,n)=mn if the triangular matrix is unitary, 
RTRSM(rn,n) —2mn in general. 

Proof. If the matrix is unitary, then a fully delayed reduction is required 
only once after the update of each row of the result. In the generic case, 
we invert each diagonal element first and multiply each element of the 
right hand side by this inverse diagonal element, prior to the update of 
each row of the result. This gives mn extra reductions. □ 

Next we show that the new pivoting strategy is more efficient in terms 
of number of integer division. 

Theorem 4. Over a prime field modulo p and on a full-rank square m x 
m matrix with generic rank profile, and m a power of two, the number 
of reductions modulo p required by the elimination algorithms with full 
delayed reduction is: 

Rpm{m, m) =2m^ + o (m^) , 

RpLE{rn,m) = Rcup{m,m) = [l + \ log2(m)) + o (m^) 

Proof. If the top left square block is full rank then PLUQ reduces to one 
recursive call, two square TRSM (one unitary, one generic) one square ma- 
trix multiplication and a final recursive call. In terms of modular reduc- 
tions, this gives: JJplUqM = 2J?pLUQ(f ) + i?UnitTRSH(f , f )+-RTRSM(f , f ) + 

^mm(^) T) t)- Therefore, using theorem 3, the number of reductions 
within PLUQ satisfies T(m) = 2T{^) -f so that it is RpLUQ{m,m) = 
2m^ — 2m if m is a power of two. 

For row or column oriented elimination this situation is more com- 
plicated since the recursive calls will always be rectangular even if the 
intermediate matrices are full-rank. We in fact prove, by induction on m, 
the more generic: 

2 

i?PLE(m, nj = logjam) (-^ —)+ m +o[mn + m) (1) 

First -Rple(1,'^) = since [1] x [ai, . . . , a„] is a triangular decomposition 
of the 1 X n matrix [ai , . . . , a„] . Now suppose that Equation 1 holds 
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for k — m. Then we follow the row oriented algorithm of [2, Lemma 5.1] 
which makes two recursive calls, one TRSM and one MM to get _RpLE(2m, n) = 
-RpLE(fn, n)+i?PLE(m,, m)+Rm{m, m, n~m)+Rpi,E{m, n~m) = R?LE{ra, n) + 
RpLEim, n — m) + m{n + m). We then apply the induction hypothesis on 
the recursive calls to get 

1 1 2 2 

-RpLE (2m, n) = - log2 (m)mn log2 (m)m +m + 

1 1 2 2 

- log2 (m)m(n - m) - -loga (m)m +m + 

m(n + m) + o {mn + m^) 
=log2(m)(mn — m^) + 3m^ + mn + o (mn + m^) . 

The latter is also obtained by substituting k ^ 2m in Equation 1 so that 
the induction is proven. □ 

This show that the new algorithm requires much less modular reduc- 
tions, as soon as m is larger than 32. Over finite fields, since reductions 
can be much more expensive than multiplications or additions by elements 
of the field, this is a non negligible advantage. We show in the next sec- 
tion that this participates to the better practical performance of the PLUQ 
algorithm. 



7 A base case algorithm 

We propose in algorithm 2 an iterative algorithm computing the same 
PLUQ decomposition as algorithm 1. The motivation is to offer an alter- 
native to the recursive algorithm improving the computational efficiency 
on small matrix sizes. Indeed, as long as the matrix fits the cache mem- 
ory, the amount of page faults of the two variants are similar, but the 
iterative algorithm reduces the amount of row and column permutations. 
The block recursive algorithm can then be modified so that it switches 
to the iterative algorithm whenever the matrix dimensions are below a 
certain threshold. 

Unlike the common Gaussian elimination, where pivots are searched 
in the whole current row or column, the strategy is here to proceed with 
an incrementally growing leading sub-matrix. This implies a Z-curve type 
search scheme, as shown on figure 2. This search strategy is meant to 
ensure the properties on the rank profile that have been presented in 
section 4. 
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Figure 2: Iterative base case PLUQ decomposition 



Algorithm 2 PLUQ iterative base case 



Input: A a, m X n matrix over a field 

Output: P,Q: m x m and n x n permutation matrices 

Output: r; the rank of A 

\l\uv' 



Output: A 



M 



where L \s r x r unit lower triang., U \s r x r upper 



triang. and such that A — P 



L 
M 



[UV] Q. 



.A. 



2 — l,r I 



r 0; i ^ 0; j ^ 
while i < m or j < n do 

> Let V — [Ai^r- ■ -Ai^j-i] and w — [Arj. 
ii j < n and w then 

p row index of the first non zero entry in w 

Q ^ j;j ^ max(j + l,n) 
else if i < TO and w ^ then 

q column index of the first non zero entry in v 

p i; i ■<r- max(i + 1, to) 
else if I < TO and j < n and Ai,j ^ then 

{p,q) ^ 

i max(i + 1, to); j ^ max(j' + 1, n) 
else 

i max(i + 1,to);j ^ max(j' + l,n) 
continue 

end if > At this stage, Ap^q is a pivot 

for k — p + 1 . . . n do 

^fe.q ^ Ak,p/ Ap^q 

Ak^q+1...7i Ak^q+l...n — ^fc,(j^p,g+l...n 

end for 

Ar+i...m,r+i ^ ^r+i...m,q > Swap pivot columu 

^r+i.r+i...n ^ ^p,r+i...n » Swap pivot row 

P <— Tp rP', Q ^ QTq r 14 > Tfe / swaps indices fc and I 

r -s— r + 1 
end while 



Remark 3. In order to further improve the data locality, this iterative 
algorithm can be transformed into a left-looking variant [1]. We did not 
write this version here for the sake of clarity, hut this is how we imple- 
mented the base case for the experiments of section 8. 

8 Experiments 

We present here experiments comparing an implementation of algorithm 1 
computing a PLUQ decomposition against the implementation of the 
CUP /PLE decomposition, called LUdivine in the FFLAS-FFPACK library^ . 
The new implementation of the PLUQ decomposition is available in this 
same library from version svn@346. We ran our tests on a single core of 
an Intel Xeon E5-4620@2.20GHz using gcc-4.7.2. 

Figures 3 and 4 compare the computation time of LUdivine, and the 
new PLUQ algorithm. In figure 3, the matrices are dense, with full rank. 
The computation times are similar, the PLUQ algorithm with base case 
showing a slight improvement over LUdivine. In figures 4 and 5, the 




5000 10000 15000 20000 

matrix dimension 



Figure 3: Computation time with dense full rank matrices over Z/1009Z. 

matrices are square, dense with a rank equal to half the dimension. To 
ensure non trivial row and column rank profiles, they are generated from 
a LEU decomposition, where L and U are uniformly random non-singular 
lower and upper triangular matrices, and E is zero except on r = n/2 
positions, chosen uniformly at random, set to one. The cutoff dimension 
for the switch to the base case has been set to an optimal value of 30 by 
experiments. Figure 4 shows how the base case greatly improves the effi- 
ciency for PLUQ, presumably for it reduces the number of row and column 

^http : / /linalg. org/ff las- ff pack 
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Figure 4: Computation time with dense rank deficient matrices (rank is half the 
dimension) 



permutations. With the base case the computation time is comparabie 
to LUdivine. More preciseiy, PLUQ becomes faster than LUDivine for 
dimensions above 9000. Figure 5 shows that, on larger matrices, PLUQ 
can be about 10% faster than LUdivine. 

Table 1 summarizes some of the data reported by the callgrind tool 
of the valgrind emulator (version 3.8.1) concerning the cache misses. We 
also report in the last column the corresponding computation time on 
the machine (without emulator). The matrices used are the same as in 
figure 4, with rank half the dimension. We first notice the impact of the 
base case on the PLUQ algorithm: although it does not change the number 
of cache misses, it strongly reduces the total number of memory accesses 
(less permutations), thus improving the computation time. Now as the 
dimension grows, the total amount of memory accesses and the amount of 
cache misses plays in favor of PLUQ which becomes faster than LUdivine. 



9 Conclusion and perspectives 

The decomposition that we propose can first be viewed as an improve- 
ment over the LEU decomposition, introducing a finer treatment of rank 
deficiency that reduces the number of arithmetic operations, makes the 
time complexity rank sensitive and allows to perform the computation 
in-place. 

Second, viewed as a variant of the existing CUP/PLE decompositions, 
this new algorithm produces more information on the rank profile and has 
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matrix dimension 



Figure 5: Computation time with dense rank deficient matrices of larger dimen- 
sion 



better cache efficiency, as it avoids calling matrix products with rectan- 
gular matrices of unbalanced dimensions. It also performs fewer modular 
reductions when computing over a finite field. 

Overall the new algorithm is also faster in practice than previous im- 
plementations when matrix dimensions get large enough. 

Now, in a parallel setting, it should exhibit more parallelism than row 
or column major eliminations since the recursive calls in step 2 and 3 are 
independent. This is also the case for the TURBO algorithm of [3], but 
the latter requires more arithmetic operations. Further experiments and 
analysis of communication costs have to be conducted in shared and dis- 
tributed memory settings to assess the possible practical gains in parallel. 
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A Correctness of algorithm 1 



First note that S 
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